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Abstract 

A new procedure, called DDa-procedure, is developed to solve the 
problem of classifying d-dimensional objects into q > 2 classes. The pro- 
cedure is completely nonparametric; it uses g-dimensional depth plots and 
a very efficient algorithm for discrimination analysis in the depth space 
[0, l] q . Specifically, the depth is the zonoid depth, and the algorithm is 
the a-procedure. In case of more than two classes several binary classifi- 
cations are performed and a majority rule is applied. Special treatments 
are discussed for 'outsiders', that is, data having zero depth vector. The 
DDa-classifier is applied to simulated as well as real data, and the results 
are compared with those of similar procedures that have been recently pro- 
posed. In most cases the new procedure has comparable error rates, but 
is much faster than other classification approaches, including the SVM. 

Keywords: Alpha-procedure, zonoid depth , DD-plot, pattern recog- 
nition, supervised learning, misclassification rate 



1 Introduction 

A steady interest in statistical learning theory has intensified recently since 
nonparametric tools have become available. A new impetus has been given 
to supervised classification by employing depth functions such as Tukey's (|23j) 
halfspace depth or Liu's ([17)) simplicial depth. In supervised learning a function 
is constructed from labeled training data that classifies an arbitrary data point 
by assigning it one of the labels [10]. Given two or more labeled clouds of training 
data in d-space, a data depth measures the centrality of a point with respect to 
these clouds. For any point in <i-space it indicates the degree of closeness to each 
label. This can be employed in different ways for solving the classification task. 
Many authors have made use of data depth ideas in supervised classification. 
Liu et al. [18] were the first who stressed the usefulness and versatility of 
depth transformations in multivariate analysis. They introduced the notion of 
a DD-plot, that is the two-dimensional representation of multivariate objects 
by their data depths regarding two given classes. In a straightforward way, an 



object can be classified to the class where it is deepest, that is, according to 
its maximum depth. Jornsten [13] and Ghosh and Chaudhuri [9] have followed 
this and similar approaches; see also Hoberg and Mosler [TT|. Dutta and Ghosh 
[H [5] employ a separator that is linear in a density based on kernel estimates 
of the projection depth, respectively L p -depth. Recently, Li et al. [TB] have 
used polynomial separators of the DD-plot to classify objects by their depth 
representation. These methods differ in the notion of depth used and allow for 
adaptive and other extensions. 

The quoted literature has in common that a (possibly high-dimensional) 
space of objects is transformed into a lower-dimensional space of depth values 
of these objects and the classification task is performed in the depth space. In 
this context several questions arise: 

1. Which particular notion of depth should be employed? 

2. Which classification procedure should be applied to the depth-represented 
data? 

3. How extends the procedure to q > 2 classes? 

The above literature answers these questions in different ways. Ad (1), half- 
space and simplicial depths, among others, have been employed in [5J HH ITS] . 
They depend only on the combinatorial structure of the data, being constant 
in the compartments spanned by them. Consequently, these depths are rather 
robust to outlying data, but calculating them in higher dimensions can be cum- 
bersome if not impossible. On the other hand Mahalanobis depth [19] . which 
has also been used by these authors, is easily calculated but highly non-robust. 
Moreover, it depends on the first two moments only and does not reflect any 
asymmetries of the data. More robust forms of the Mahalanobis depth remain 
still insensitive to data asymmetries. Li-depth as used in [13] has similar draw- 
backs. [5] employ L p -depths, which are easily calculated if p is known, and 
choose p in an adaptive procedure; however the latter needs heavy computa- 
tions. In [11] the maximum zonoid depth and a combination of it with the 
Mahalanobis depth are used; both can be efficiently calculated also in high di- 
mensions but lack robustness. Ad (2), Li et al. [TB] solve the classification 
problem of the DD-plot by designing a polynomial line (up to order three) that 
separates the unit square. The coefficients of the polynomial are selected by 
cross validation, searching for a minimal average misclassification rate (AMR). 
The same is done in [4] and [5]. 

Ad (3) with q > 2 classes a given point is usually classified in two steps 
according to majority rule: firstly (2) classifications are performed that are 
restricted to pairs of classes in the object space, and secondly the point is 
assigned to that class where it was most often assigned in step 1. 

In this paper, ad (1), we employ the zonoid depth [H] [5D], as it can be ef- 
ficiently calculated also in higher dimensions (up to d = 20 and more) and has 
excellent theoretical properties regarding continuity and statistical inference. 
However the zonoid depth has a low breakdown point. If, in a concrete appli- 
cation, robustness is an issue the data have to be preprocessed by some outlier 
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detection procedure. Ad (2), for final classification in the depth space a vari- 
ant of the a-procedure is employed. It operates simply and very efficiently on 
low-dimensional spaces like the depth spaces considered here. The a-procedure 
has been originally developed by Vasil'ev [55] and Lange [37] . Ad (3) we 
employ DD-plots if there are two classes and g-dimensional depth plots if there 
are q > 2 classes. Assignment of a given point to a class is based on (*) binary 
classifications in the g-dimensional depth space plus a majority rule. Note that 
in each binary classification the whole depth information regarding all q classes 
is used. 

We call our approach the DDa-approach and apply it to simulated as well 
as real data. The results are contrasted with those obtained in [TB] , [I] , and [5] . 

The contribution of this paper threefold. A classification procedure is pro- 
posed that 

1. is efficiently computable for objects of higher dimensions, 

2. employs a very fast classification procedure of the D-transformed data, 

3. uses the full multivariate information when classifying into q > 2 classes, 

The rest of the paper is organized as follows. Section 2 introduces the 
depth transform, which maps the data from <i-dimensional object space to 
(/-dimensional depth space, and provides a first discussion of the problem of 
'outsiders', that are points having a vanishing depth vector. In Section 3 our 
modification of the a-procedure is presented in some detail. Section 4 provides 
a number of theoretical results regarding the behavior of the DDa-procedure on 
elliptical and mirror symmetric distributions. Section 5 contains extensive sim- 
ulation results and comparisons. Calculations of real data benchmark examples 
are reported in Section 6 as well as a comparison of the DDa-procedure with 
the SVM approach. Section 7 concludes. 

2 Depth transform 

A data depth is a function that measures, in a certain sense, how close a given 
point x is located to the "center" of a finite set X in M. d , that is, how "deep" it 
is in the set. More precisely, a data depth is a function 

(x,X) h->Z> x (x) e [0,1], xel d , XcR d , 

that satisfies the following restrictions: affine invariant; upper semicontinuous 
in x; quasiconcave in x (that is, having convex upper level sets), vanishing if 
||x|| — > oo. Sometimes two weaker restrictions are imposed: orthogonal invari- 
ant; decreasing on rays from the a point of maximal depth (that is, starshaped- 
ness of the upper level sets) . For surveys of these restrictions and many special 
notions of data depth, see e.g. [251 |2U] |5] [2T] [2] . 

Now, assume that data in R d are to be classified into q > 2 classes and 
that X\ , . . . , X q C K d are training sets for these classes each having finite size 



3 



rij = \Xj\. Let D be a data depth. The function E d — > [0, l] 9 mapping 

x^d:=(%(x),..,Dx,(x)) (1) 

will be mentioned as a depth representation. Each object is represented by a 
vector whose q components indicate its depth or closeness regarding the q classes. 
In particular, the training sets Xj C M. d are transformed to sets in [0, l] q that 
represent the classes in the depth space. It should be noted that 'closeness' 
of points in the object translates to 'closeness' of their representations. The 
classification problem then becomes one of partitioning the depth space [0, l] q 
into q parts. 

A simple rule, e.g., is to classify a point to that class where it has the 
largest depth value; see [HJ [T3] ■ This means that the depth space decomposes 
into q compartments which are separated by (parts of) q bisecting hyperplanes. 
Maximum depth classification is a linear rule. A nonlinear classification rule is 
used in Li et al. |16j . who treat the case q — 2 by constructing a polynomial 
line up to degree 3 that separates the depth space [0, l] 2 ; see also [U[S]. 

With several important notions of data depth, Dx(x) vanishes outside the 
convex hull of X. This is, e.g., the case with the halfspace, simplicial, and 
zonoid depths, but not with the Mahalanobis and L p -depths. A point that is 
not within the convex hull of at least one training set then is mapped to the 
origin in the depth space. Such a point will be mentioned as an outsider. Of 
course, it can be neither regarded as correctly classified nor ignored. To classify 
this point we may consider three principal approaches, each allowing for several 
variants. 

• Classify randomly, either with probabilities proportional to the sizes of 
classes or, if these probabilities are unknown, with equal probabilities. 

• Use the fc-nearest neighbors method with a properly chosen distance: Eu- 
clidean distance, L p -distance, Mahalanobis distance with moment esti- 
mates, Mahalanobis distance with robust estimates (MCD, cf. e.g. [T2]). 

• Classify with maximum Mahalanobis depth (using moment estimates or 
MCD) or a with the maximum of another depth that is properly extended 
beyond the convex hull as e.g. in [IT] . 

In the sequel we will use either random classification, fc-nearest neighbors (with 
different distances) , or maximum Mahalanobis depth (with moment and robust 
estimates) . 

3 The a-procedure 

To separate the q classes in the multi-depth space we use the a-procedure, which 
has been developed by Vasil'ev [25JI215] and Lange [27], see also [15] . 

Let us first present the procedure in the case of q = 2 classes. As above con- 
sider two clouds of training data in M. d , X = {xi, . . . , x ni } and Y = {yi, . . . , y„ 2 } 
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and notate x„ 1+m = y m , m = 1, n 2 . By calculating the depth of all Xj with 
respect to each of the two clouds, their depth representation, (Z)x(xj), ZV(xj)), 
is obtained, i = 1, 2, . . . , m + n2. The set 

P = {di eM 2 |di = ( J D x (x i ),£ > K(x i )),i = !,•••, ni + n 2 } 

is the DD-plot of the data ([18]). 

We use a modified version of the a-procedure to construct a nonlinear sep- 
arator in [0, l] 2 that classifies the D-represented data points. The construction 
is based on depth values and the products of depth values up to some degree p. 
For this, a linearized representation of the two classes in a depth feature space 
is produced as follows: If p = 2, say, we include the squared depth values and 
the product of the two depths as additional components of the representation, 
yielding an extended D-representation in [0, l] 5 , 

Z = {zi | z, = (DxtelDyixilDxte) •£> y (x i ),I&(xi),I$(x i )) , 
i = 1, ...,m + n 2 } ■ 

Each element of the extended D-representation is mentioned as a basic D- 
feature and the space [0, l] r as the feature space. When the maximum expo- 
nent is p > 1, Zj is a vector in R r having components 

Dx^f" ■ D Y (xif\ where l<k v +t v <p, i/ = l,...,r. (2) 

The number of basic D-features, that is the dimension of the feature space, 
equals r — ( p ^ 2 ) — 1, which is easily seen by induction. We index the basic 
D-features by v and notate z, — (ziv) v =\,...,r- 

The a-procedure now, in a stepwise way, performs linear discrimination 
in subspaces of the feature space. It is a bottom-up approach that succes- 
sively builds new features from the basic D-features. In each step certain two- 
dimensional subspaces of Z are considered, and the projection of Z to each of 
these subspaces is separated by a straight discrimination line. Out of these sub- 
spaces the a-procedure selects a subspace whose discrimination line provides the 
least average misclassification rate. Clearly any discrimination line that sepa- 
rates the DD-plot must pass through the origin since Dx(xi) = Dy(xi) = 
implies that the point Xj cannot be classified to either of the two classes. The 
same must hold for all discrimination lines in subspaces of the extended depth 
space. 



In a first step a pair (Vi, v?) of D-features ((2]) is chosen with (k\ + k2)(£i + 
£2) > 0. The latter restriction implies that the two D-features do not solely 
relate to one of the classes. A straight discrimination line is calculated in the 
two-dimensional coordinate subspace defined by the pair (fi,^)- As the line 
passes through the origin it is characterized by an angle a £ [0,2n[. The best 
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Figure 1: a-procedure; step 1. 



discriminating angle a VltU2 is determined by minimizing the average classifica- 
tion error, 

1 711 

A(a;^i,^ 2 ) = f / I(zi v, cos a - Zi ,„ 2 sin a < 0) (3) 

ni + ri2 1 

" l+" 2 

+ I{ z i,vi cos a — sin a > 0)] . 

i=ni + l 

Here /(A) denotes the indicator function of A. If the minimum is attained in an 
interval, its middle value is selected for a„ ltV3 ; see Figure [TJ The same is done 
for all pairs of D- features satisfying the above restriction, and the pair (u^, 
is selected that minimizes ([3]) . If the minimum is not unique the pair with the 
smallest k and I is chosen. Let a^ 1 ' = a v *. v * and denote the respective average 
classification error by A^ 1 ). Next the D-features v\ and are replaced by a 
new D-feature which is indexed by \i\ and gives value 

^(■i = Vi cosa(1 '~ z: >,i'2 sina<1) i i = +ri2, (4) 

to each Xj. Geometrically the values are obtained by projecting (zi tVl , Zi^ 2 ) to a 
straight line in the {y\, f 2 )-plane that is perpendicular to the discrimination line; 
see Figure [TJ The first step results in the new D-feature /xi and the classification 
error AW produced by classifying according to this feature. 

The second step couples the new D-feature fix with each of the basic D- 
features v that have not been replaced so far. For each of these pairs of D- 
features a best discriminating angle is determined, and among these the 

pair of D-features is selected that provides the minimum average classification 
error. The minimum error is denoted by A^ 2 - 1 and the angle at which it is 
attained by a^ 2 \ This is visualized in Figured The best pair of D-features is 
replaced by a new D-feature /i 2 , where the values Zi i/i2 are calculated as in 

The last step is repeated with [ii in place of fix, etc. The procedure stops 
after step t if either the additional discriminating power — A 1 -' 4 " 1 ' = or t — 
r, that is, all basic D-features have been replaced. Then the angle defines 
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Figure 2: a-procedure; step 2. 

a linear rule for discriminating between two (up to) p-th order polynomials in 
Dx(z) and ZV(z), which correspond to the two finally constructed D-features, 
according to their sign. This yields a polynomial separation of the classes in the 
depth space. 

For example, let in step 1 the basic features Dx and D Y be selected and, 
consequently, Dx ■ Dy and D x be included in steps 2 and 3. If the procedure 
terminates after step 3, the result is a polynomial in the two depths Dx(x) and 
Dy(x) that has form 

ttfx(x) + bD 2 x (x) + cDf-(x) + dD x (x)D Y {x) 

A given point x of the object space then is classified according to the sign of 
the polynomial. 

If there are more than two classes, say X%, . . . ,X q , each data point Xj is 
represented by the vector of depth values d = {Dx 1 (xi), . . . , Dx q (xj)) in [0, l] 9 . 
Again a depth feature space is considered of some order p; it has dimension 
r — ( p ^ q ) — 1. With q > 2 classes each set Xj is separated from the remaining 
ones by the a-procedure in the same way as above: In each step a pair of D- 
features is replaced by a new D-feature as long as the average classification error 
decreases and basic D-features are left to be replaced. For each j For each pair of 
classes the procedure results in a hypersurface that separates the g-dimensional 
depth space into two sets of attraction. A given point x is finally assigned to 
that class to which it has been most often attracted. 
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4 Some theoretical aspects 



In order to investigate some properties of the DDa-approach we transfer it to 
a more general probabilistic setting and define a depth function as the popu- 
lation version of a data depth. Let V be a properly chosen set of probability 
distributions on M. d that includes the empirical distributions. A depth function 
D is a function that assigns a value -Dp(x) £ [0, 1] to every x £ K rf and P £ V 
in an afhne invariant way and has convex compact upper level sets. Obviously, 
the restriction of a depth function D to the class of empirical distributions is an 
afhne invariant quasiconvcx data depth.. For details on general depth functions, 
see e.g. the above cited surveys [2j [2Ql EU [28] . 

While data depth is an intrinsically nonparametric notion, the behavior of 
depth functions and depth based procedures on parametric classes is of special 
interest as it indicates how the nonparametric approach relates to the more 
classical parametric one. As a generalization of multivariate Gaussian distribu- 
tions, spherical and elliptical distributions play an important role in parametric 
multivariate analysis. A random vector X in R d has a spherical distribution if 
X = R ■ U, where U is a random vector uniformly distributed on the sphere 
S d ~ 1 and R is a random variable having support [0, oo[ and being independent 
of U. A random vector Y has an elliptical distribution if it is an affine trans- 
form of a spherically distributed X, Y = fi + BX. If R has a density r we 
notate Y ~ Ell(/i, BB', r). As, by definition, a depth function is affine invari- 
ant, it operates on elliptical distributions in a rather simple way. The following 
propositions give some insight into the the behavior of depth functions and the 
DDa-procedure if the data generating processes are elliptical. 

Proposition 1 If D is an affine invariant depth function and P an elliptical 
distribution, then for every a £]0, 1] the upper level set 

D a {P) = {x£R d |Dp(x) > a} 

is an ellipsoid. 

Proof. Let P = Ell(n,BB',r) and a g]0, 1]. Consider P = Ell(0,7 d ,r). 
Then, for all j3 > a, {x £ R d |£)p(x) — (3} is a sphere since D is, in particular, 
orthogonal invariant. Hence, D a (Po) = {x £ M d |Z?p (x) > a} is a ball and, by 
affine transformation with /i and B, D a (P) is an ellipsoid. □ 

Proposition 2 (i) Let D be the zonoid depth and P a unimodal elliptical dis- 
tribution, that is P = Ell(^, BB' ,r). Then for every non-empty density 
level set {x £ R d |/(x) > (3} exists some a — <p(f3) such that 

{xeW d \f(x)>f3} = D a (P). 

(ii) If, in addition, r has an interval support then tf> is a continuous, strictly 
increasing function. It holds Dp(x) = </>(/(x)) and therefore 

/(x)>/(y) Dp(x) >D P (y). (5) 
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Proof, (i): Note that D = R d . Thus, if /3 < 0, the claim holds with a = 0. 
Now let /3 > and assume w.l.o.g. that P is spherical. Then {x G K. d |/(x) > /?} 
is a ball with center at the origin. Let x* be a point on its surface. Also the 
central regions D a are balls around the origin. By Theorems 3.9 and 3.14 in [20] . 
the D a are continuous and strictly decreasing on the convex hull of the support 
of P and it holds a* := D P (x*) > 0. We conclude D a * = {x G M d |/(x) > /?}. 
(ii): Under the additional premise, the density level sets are continuously and 
strictly decreasing in j3 > 0, which yields the result. □ 

Corollary 1 Consider a mixture of unimodal elliptical distributions Pj = EU(/Zj, BjB'p rj), 
j = 1, . . . , q, with mixing probabilities Hj and assume that all rj have an interval 
support. Let D be the zonoid depth. 

Then, for each j and k exists a strictly increasing function ipjk so that 



Proof. From Proposition [2] continuous and strictly increasing functions 
<pj and (f>k are obtained with Dp^x) — <f>j(fj(x)) and Dp k (x) — (f>j(fk(x)). 
Consequently, 



A similar result holds for other data depths including the halfspace, sim- 
plicial, projection and Mahalanobis depths; see Prop. 1 in [IB] , In the rest of 
section we consider the limit behavior of the DDa-procedure under independent 
sampling. For this, we assume that the empirical depth is a consistent estima- 
tor of its population version. This is particularly true for the zonoid, halfspace, 
simplicial, projection and Mahalanobis depths. 

Theorem 1 (Bayes rule) Let F and G be probability distributions in M. d and 
H be a hyperplane H such that G is the mirror image of F with respect to H. 
Then based on a 50:50 independent sample from F and G the D Da-procedure 
will asymptotically yield the linear separator that corresponds to the bisecting 
line of the DD-plot. 

Note that the rule given in the theorem corresponds the Bayes rule, see [TP] , 

Proof. Due to the mirror symmetry of the distributions in E d the DD-plot 
is symmetric as well. Symmetry axis is the bisector, which is obviously the 
result of the a-procedure when the sample is large enough. □ 

Theorem 2 Let F, G be elliptical, F = E11(^ F , BB\ r), G = Ell(jU G , BB' ,r). 
Then based on a 50:50 independent sample from F and G the DDa-procedure 
will asymptotically yield the linear separator that corresponds to the bisecting 
line of the DD-plot. 

Proof. If F and G are spherically symmetric, they satisfy the premise of 
the previous theorem. A common affine transformation of F and G does not 
change the DD-plot. □ 






which proves the claim by use of the function ipjk(-) = 4>j ( $k 



□ 
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5 Simulation study 



The DDa-procedure has been implemented on a standard PC in an i?-environ- 
ment. To explore its specific potencies we apply it to simulated as well as to real 
data. The same data have been analyzed with several classifiers in the litera- 
ture. In this section results on simulated data are presented regarding the aver- 
age misclassification rate (AMR) of nine procedures besides the DDa-classificr 
(Sec. 15 . 1]) . Then the speed of the DDa-procedure is quantified fSec. I5.2| ). The 
following Section [6] covers the relative performance of the the DDa- and other 
classifiers on several benchmark data sets. 

5.1 Comparison of performance 

To simplify the comparison with known classifiers, we use the same simulation 
settings as in [16]. These are supervised classification tasks with two equally 
sized training classes. Data are generated by ten pairs of distributions according 
to Table [TJ Here N and Exp denote the Gaussian and exponentional distribu- 
tions, respectively, and 



The DDa-classifier is contrasted with the following nine classifiers: linear 
discriminant analysis (LDA), quadratic discriminant analysis (QDA), k- nearest 
neighbors classification (fc-NN), maximum depth classification based on Maha- 
lanobis (MM), simplicial (MS), and halfspace (MH) depth, and DD-classification 
with the same depths (DM, DS and DH, correspondingly). For more details 
about the data and the procedures as well as for some motivation the reader is 
referred to [16]. 

All simulations of [TB] are recalculated following their paper as close as pos- 
sible. The LDA, QDA and fc-NN classifiers are computed with the R-packages 
"MASS" and "class", where the parameter k of the fc-NN-classifier is selected by 
leave-one-out cross-validation over a relatively wide range. The Mahalanobis, 
simplicial, and halfspace depths have been determined by exact calculations 
with the R-package " depth" . 

The zonoid depth has been exactly computed by the algorithm in [7] . Recall 
that in dimension two such calculations can be efficiently done by a circular 
procedure and note that the problem of prior probabilities has been avoided 
by choosing samples of equal size for both classes. For the DD-classifiers a 
polynomial line (up to degree three) is determined to discriminate in the twodi- 
mensional DD-Plot, a tenfold cross-validation is employed to choose the optimal 
degree of the polynomial, a smoothing constant t=100 is selected in the logistic 
function, and the DD-Plot is never rotated. Each experiment includes a training 
phase and an evaluation phase: From the given pair of distributions 400 obser- 
vations (200 of each class) are generated to train the classifier, and 1000 (500 
of each) observations to evaluate its AMR. For each distribution pair and each 
classifier 100 experiments are performed, and the resulting sample of AMRs is 




-(7i * |N(0, 1)| with probability 1/2, 

cr 2 * |N(0, 1)| +fi with probability 1/2. 
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Tabic 1: Distributional settings used in the simulation study 



Alternative 


1-st class 


2-nd class 


Normal 
location 


N([8],[li]) 


N([l],[li]) 


Normal 
location-scale 


N([8],[ii]) 


N([i],[iA]) 


Cauchy 
location 


Cauchy([8],[iJ]) 


Cauchy([l],[U]) 


Cauchy 
location-scale 


Cauchy([8],[iJ]) 


Cauchy([l],[|4] ) 


Normal 
contaminated 
location 


Learning sample: 90% as si, 
10% from N([ig],[ii]). 
Testing sample: as si 


as si 


Normal 
contaminated 
location-scale 


Learning sample: 90% as s2, 
10% from N([J8],[ii]). 
Testing sample: as s2 


as s2 


Exponential 
location 


(Exp(l),Ex P (l)) 


(Exp(l) + l,Exp(l) + l) 


Exponential 
location-scale 


(Exp(l),Exp(l/2)) 


(Exp(l/2) + l,Exp(l) + l) 


Asymmetric 
location 


(MixN(0; 1, 2), MixN(0; 1, 4)) 


(MixN(l;l,2),MixN(l;l,4)) 


Normal- 
exponential 


n([8],[5?]) 


(Exp(l),Exp(l)) 
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Figure 3: Normal location (left) and location-scale (right) alternatives 



visualized as a box-plot; see Figures [3] to [7] 

As we have discussed at the end of Section [21 with depths like the simplicial, 
halfspace and zonoid depth the problem of outsiders arises. An outsider is, in the 
DD-plot, represented by the origin. A simple approach is to assign the outsiders 
randomly to the two classes. Throughout our simulation study we have chosen 
the random assignment rule, which results in kind of worst case AMR. Observe 
that this choice of assignment rule discriminates against the procedures that 
yield outsiders and advantages those that do not, in particular LDA, QDA, MM 
and DM for all distribution settings. 

The principal results of the simulation study are collected in Figures [3J to 
[7l Under the normal location-shift model (Figure [3j left) all classifiers behave 
satisfactorily, and the DDa-classifier performs well among them. However LDA, 
QDA, MM and DM show slightly better results since they do not have to cope 
with outsiders like the other depth-based procedures. 

Also under the normal location-scale alternative (Figure [31 right) the DDa- 
classificr performs rather well, like all DD-classifiers. A slightly worse perfor- 
mance of the DDa-classifier is observed when discriminating the Cauchy loca- 
tion alternative (Figure [U left), but it is still close to the DD-classifiers. This 
can be attributed to the lower robustness of the zonoid depth. However, when 
scaling enters the game (Cauchy location-scale alternative, Figure QJ right), the 
DDa-classifier again performs quite satisfactorily. The same picture arises when 
considering contaminated normal settings (Figure[5l both left and right). Under 
a location alternative, the DDa-classifier is a bit worse than the DD-classifiers, 
while it slightly outperforms them in a location-scale setting. 

The relative robustness of the DDa-classifier may be explained by two of 
its features: First it maps the original data points to a compact set, the q- 
dimensional unit hypercube. Second, for classification in the unit hypercube it 
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employs the a-procedure, which, by choosing a median angle in each step, is 
rather insensitive to outliers. 

Under exponential alternatives (Figure H]) the DDa-classifier shows excellent 
performance, which is even similar to that of the the fc-NN for both location and 
location-scale alternatives. Its results for the asymmetric location alternative 
(Figure left) are somewhat ambiguous, though still close to those of the DD- 
classifiers. Concerning the normal-exponential alternative (Figure [7l right) the 
DDa-classifier performs distinctly better than the others considered here. 

On the basis of the simulation study we conclude: The DDa-classifier (1) 
performs quite well under various settings of elliptically distributed alternatives, 
it (2) is rather robust to outlier prone data, and (3) shows a distinctly good 
behavior under asymmetrically distributed alternatives and when the two classes 
originate from different families of distributions. 

5.2 Speed of the DDa-procedure 

To estimate the speed of the DDa-classification we have quantified the total 
time of training and classification times under two simulation settings, a shift 
and and a location-shift alternative concerning d-variate normals (see Table [5~2l 
header), with various values of dimension d and of training classes size n. An 
experiment consists of a training phase based on two samples (of total size n) 
and an evaluation phase, where 2500 points (1250 from each distribution) are 
classified and the AMR is determined. Each experiment is performed 100 times. 
All these computations have been conducted on a single kernel of the processor 
Core i7-2600 (3.4 GHz) having enough physical memory. 

Table f5T2l exhibits the average computation times (in seconds, with the stan- 
dard deviations in parentheses) under the two distributional settings and for 
different d and n. As it is seen from the table, the DDa-classifier is a very fast, 
in the learning phase as well as in classifying high amounts of data. However, 
computation times increase considerably with the number of training points, 
which is due to the many calculations of zonoid depth needed. With dimension 
d computation time grows slower, which may be explained as follows. With 
increasing dimension of the data space, more points come to lie on the convex 
hull (thus having depth = 1/n) or outside it (in particular those of the other 
class, thus having depth = 0). The algorithm from [7] computes the depth of 
such points much faster than that of points having larger depths. 

6 Benchmark studies 

Concerning real data, we take benchmark examples from [161 21 [S] to compare 
the performance of the DDa-classifier with respect to average misclassification 
rate (Sec. 16. ip . In addition we use four real data sets from [22 to contrast the 
DDa-classifier with the support vector machine (SVM) of 24] regarding both 
performance and time fSec. 16.2")) . 
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Table 2: Computing times of DDa-classification, in seconds; number n in each 
class, dimension d. 





N(O rf ,I rf ) 
N(0.25-l d ,I rf ) 


N(O d ,I d ) 
N((0.25 0^ 


-i)',5-I rf ) 








d= 5 


d= 10 


d= 15 


d= 20 


d= 5 


d = 10 


d = 15 


d = 20 




200 


0.14 


1.55 


1.89 


2.24 


0.15 


1.62 


1.94 


2.2 






(0.00014) 


(0.00014) 


(-) 


(-) 


(0.00014) 


(0.00016) 


(0.00021) 


(0.00027) 


n 


500 


1.04 


10.37 


12.58 


14.14 


1.09 


11.33 


14.44 


15.18 




(0.00046) 


(0.00052) 


(0.00062) 


(-) 


(0.00044) 


(0.00059) 


(0.00079) 


(0.0010) 




1000 


5.33 


42.54 


53.66 


59.18 


5.24 


47.63 


67.22 


74.15 






(0.0012) 


(0.0014) 


(0.0017) 


(-) 


(0.0011) 


(0.0016) 


(0.0022) 


(0.0026) 



Table 3: Overview of benchmark examples; dimension (d), classes (q), training 
points (# train), classified points (# class), total data (# data). 



N 


Dataset 


Results 


9 




# train 


# class 


# data 


1 


Biomedical 


Tables [51 12 


2 


4 


150 


44 


194 




Table M 


2 


4 


100 


94 


194 


2 


Blood 


Table © 


2 


3 


374 


374 


748 




Transfusion 


Table U 


2 


3 


500 


248 


748points classified 


3 


Diabetes (1) 


Table U 


3 


5 


100 


45 


145 


4 


Diabetes (2) 


Table [7] 


2 


8 


767 


1 


768 


5 


Ecoli 


Table [7] 


3 


7 


271 


1 


272 


6 


Glass 


Tables 0E] 


2 


5 


100 


46 


146 




Table [7| 


2 


9 


145 


1 


146 


7 


Hemophilia 


Table ® 


2 


2 


50 


25 


75 


8 


Image Segmentation 


Table g] 


2 


10 


500 


160 


660 


9 


Iris 


Table [7] 


3 


4 


149 


1 


150 


10 


Synthetic 


Tables [HE] 


2 


2 


250 


1000 


1250 



6.1 Benchmark comparisons with nonparametric classi- 
fiers 

As our benchmark examples are well known, we refer to the literature for their 
detailed description and restrict ourselves to mentioning the dimension d, the 
number of classes q, the total number of points in the training classes (# train), 
and the number of points classified (# class); see Table |3l 

Tables |U [5] and [6] exhibit the performance (in terms of AMR, with standard 
errors in parentheses) of the DDa-classifier together with the performance of the 
different classifiers investigated in [IB], [3] and [5] and based on the respective 
benchmark data. When applying the DDa-classifier an auxiliary procedure has 
to be chosen by which outsiders are treated. In our benchmark study we employ 
several such procedures. 

In Table [4] the DDa-procedure is contrasted with the real data results in 
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Table 4: Performance comparison with DD-classifier 



Dataset 


LDA 


QDA 


fc-NN 


MM 


MH 


DM 


DH 


DDq 


Biomedical 


17.05 


13.05 


14.32 


27.14 


18.00 


12.25 


17.48 


24.59 




(0.49) 


(0.38) 


(0.45) 


(0.6) 


(0.49) 


(0.4) 


(0.51) 


(0.63) 


Blood 


29.49 


29.11 


29.74 


32.56 


30.47 


26.82 


28.26 


32.27 


Transfusion 


(0.08) 


(0.13) 


(0.13) 


(0.29) 


(0.3) 


(0.19) 


(0.19) 


(0.25) 


Image 


8.17 


9.44 


5.59 


9.12 


11.87 


9.54 


13.98 


43.58 


Segmentation 


(0.2) 


(0.19) 


(0.19) 


(0.23) 


(0.25) 


(0.2) 


(0.29) 


(0.34) 



[16]. Here we use the same settings as in Section [5~T1 and classify the outsiders 
on a random basis. All results in Table [4] have been recalculated. 

As we see from the Table, the performance of our new classifier is mostly 
worse than the classifiers considered in |16j . Only in the Blood Transfusion 
case the AMR has comparable size. However, in this comparison the eventual 
presence and treatment of outsiders plays a decisive role. Observe that [TB] in 
their procedures MH and DH use the random Tukey depth [3] to approximate 
the halfspace depth of a data point in dimension three and more. But the 
random Tukey depth generally overestimates the halfspace depth so that some 
of the outsiders remain undetected. This implies that, in the procedures MH 
and DH, considerably fewer points (we observed around 16%, 4% and 11% 
correspondingly) are treated as outsiders and assigned on a random basis. 

In fact, as exactly determined by calculating the zonoid depth, the rate of 
outsiders in the Biomedical Data (with d = 4) totals some 35%, in the Blood 
Transfusion Data (d — 3) about 11%, and in the Image Segmentation Data 
with d = 10 about 86%. This is in line with our expectation: the higher the 
dimension of the data the higher is the outsider rate. In contrast to the MH and 
DH procedures, the DDa-procedure detects all outsiders and, in the comparison 
of Table |4l assigns them randomly. Obviously the performance of the latter can 
be improved with a proper non-random procedure of outsider assignment. In 
the subsequent benchmark comparisons several such procedures of non-random 
outsider assignment are included. 

Dutta and Ghosh [J introduce classification based on projection depth and 
compare it with several variants of maximum Mahalanobis depth (MD). The 
same authors [5] propose an L p -depth classifier (with optimized p) and contrast 
it with two types of MD. To compare the DDa-classifier on a par with [11 [5] we 
implement the following rules for handling outsiders: First, A:-nearest-neighbor 
rules are used with various k and either Euclidean or Mahalanobis distance, 
the latter with moment or, alternatively, MCD estimates. Second, maximum 
Mahalanobis depth is employed, again based on moment or MCD estimation. 
As the fc-NN results of the benchmark examples do not vary much with fc, we 
restrict to k = 1. Consequently, five different rules for treating outsiders remain 
for comparison. Tables [5] and [5] exhibit the performance of the DDa-classifier 
vs. the projection-depth classifiers of [4] and the L p -depth classifiers of [5], 
respectively, regarding the benchmark examples investigated in these papers. 
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Table 5: Performance comparison with projection depth classifier 



Dataset 


MD 

(SS) 


MD 

(MS) 


MDs 
(SS) 


MDs 
(MS*) 


PD 

(SS) 


PD 

(MS) 


DDa-classifier 


1-NN 


Mahalanobis 
depth 


Eucl. 
dist. 


Mah. dist. 


Mom. 


MCD 


Mom. 


MCD 


Synthetic 


13.00 


11.60 


10.30 


10.40 


10.00 


10.50 


12.10 


11.90 


12.00 


11.90 


12.00 


Glass 


26.59 
(0.25) 


26.14 
(0.25) 


24.92 
(0.25) 


24.43 
(0.25) 


25.70 
(0.34) 


25.24 
(0.33) 


29.45 
(0.20) 


25.79 
(0.17) 


24.73 
(0.18) 


30.09 
(0.18) 


35.06 
(0.22) 


Biomedical 


12.44 
(0.13) 


12.04 
(0.12) 


14.25 
(0.13) 


14.03 
(0.14) 


12.37 
(0.14) 


12.18 
(0.13) 


13.51 
(0.14) 


19.59 
(0.18) 


17.90 
(0.17) 


12.91 
(0.14) 


15.23 
(0.16) 



Table 6: Performance comparison with L p -depth classifier 













DDa-classifier 












1-NN 


Mahalanobis 




MD 


L f 


D 


Eucl. 


Mah. 


dist. 


depth 


Dataset 


Mom. 


MCD 


Mom. 


MCD 


dist. 


Mom. 


MCD 


Mom. 


MCD 


Synthetic 


10.20 


10.60 


9.60 


10.70 


12.10 


11.90 


12.00 


11.90 


12.00 


Hemophilia 


15.84 


17.13 


15.39 


16.43 


16.63 


17.98 


18.36 


18.65 


19.39 




(0.30) 


(0.32) 


(0.32) 


(0.32) 


(0.20) 


(0.20) 


(0.19) 


(0.22) 


(0.22) 


Glass 


26.80 


24.80 


27.64 


24.75 


30.13 


28.37 


26.63 


32.88 


36.82 




(0.26) 


(0.29) 


(0.29) 


(0.26) 


(0.19) 


(0.22) 


(0.20) 


(0.22) 


(0.23) 


Biomedical 


12.35 


14.48 


12.68 


15.11 


13.74 


22.09 


20.89 


14.34 


17.28 




(0.14) 


(0.15) 


(0.15) 


(0.15) 


(0.09) 


(0.16) 


(0.14) 


(0.12) 


(0.14) 


Diabetes 


8.22 


11.49 


9.39 


11.92 


10.77 


18.36 


18.33 


12.70 


15.90 




(0.18) 


(0.22) 


(0.21) 


(0.27) 


(0.12) 


(0.18) 


(0.20) 


(0.18) 


(0.19) 


Blood 


22.75 


22.17 


22.30 


22.06 


23.11 


22.73 


22.92 


22.59 


22.17 


Transfusion 


(0.07) 


(0.08) 


(0.07) 


(0.07) 


(0.06) 


(0.06) 


(0.06) 


(0.06) 


(0.06) 



The last five columns of the Tables report the AMR (standard deviations in 
parentheses) of the DDa-classifier when one of the five outsiders treatments is 
chosen. The remaining columns are adopted as they stand in [3] and [5]. 

Regarding the Biomedical Data, [I] do not specify the sample sizes they use 
in training and testing. For the DDa-classifier, we select 100 observations of the 
larger class and 50 of the smaller class to form the training sample; the remaining 
observations constitute the testing sample. As it is seen from Table O the DDa- 
classifier shows results similar to the projection-depth classifier (except with 
the Synthetic Data), while the performance of outsider- handling methods varies 
depending on the type of the data. Specifically, with the Glass Data 1-NN based 
on the Mahalanobis distance (both with the moment and the robust estimate) 
performs best in handling outsiders. On the other hand, with the Biomedical 
Data the same approach performs quite poorly, while treating outsiders with 
moment-estimated Mahalanobis depth or Euclidean 1-NN yields best results. 

Table [5] presents a similar comparison of the DDa-classifier with the L p - 
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classifier of [5]. The same approaches are included to treat outsiders. In all 
six benchmark examples the DDa-classifier generally performs worse than the 
best Lp-depth classifier. However, its performance substantially depends on the 
chosen treatment of outsiders. In all examples the AMR of the DDa-classifier 
comes close to that of the L p -depth classifier, provided the outsider treatment 
is properly selected. On the Hemophilia Data, e.g., Euclidean 1-NN should 
be chosen. On the Glass Data a 1-NN outsider treatment with robust Maha- 
lanobis distance performs relatively best, etc. On the Blood Transfusion Data 
all outsider-handling approaches show equally good performance. 

6.2 Benchmark comparisons with SVM 

The support vector machine (SVM) is a powerful solver of the classification 
problem and has been widely used in applications. However, different from the 
DDa-classifier, the SVM is a parametric approach, as in applying it certain 
parameters have to be adjusted: the box-constraint and the kernel parameters. 
The AMR performance of the SVM depends heavily on the choice of these 
parameters. In applications, optimal parameters are selected by some cross- 
validation, which affords extensive calculations. Once these parameter have 
been optimized, SVM-classification is usually very fast and precise. 

In comparing the SVM with the DDa-procedure, this step of parameter 
optimization has to be somehow accounted for. Here we introduce a two-fold 
view on the comparison problem: Two values of the AMR are calculated and 
reported, first the best possible AMR when the parameters have been optimally 
selected, second the average AMR when the parameters vary over specified 
ranges. As ranges we choose the intervals between the smallest and the largest 
number that arise as an optimal value in one of our benchmark data examples. 
This seems us a fair and, regarding the parameter ranges, rather conservative 
approach. 

As benchmark four well-known data sets are employed in the sequel: Dia- 
betes, Ecoli, Glass, and Iris Data, being taken from the UCI machine learning 
repository pQ. The DDa-classifier is calculated with the same outsider treat- 
ments as above. For the SVM-classifier we use radial basis function kernels 
as implemented in LIBSVM with the R-Package "el071" as an R- interface. 
Leave-one-out cross validation is employed for performance estimation of the all 
classifiers. The computation has been done on the same PC as in Sec. 15.21 

The results on the best possible AMR and the average AMR are collected 
in the Table [3 together with time quantities and portions of outsiders. The Iris 
Data appears twice in the Table. First the original are used, and second the 
same data after a preprocessing step. The preprocessing consists in the exclusion 
of an obvious outlier in the DD-plot that was identified by visual inspection of 
the plot. 

The overall analysis of the Table [7] shows that, even if using an arbitrary 
technique for handling outsiders, the DDa-classifier mostly performs not much 
worse than an SVM where the parameters have been optimally chosen. In 
contrast, if the SVM is employed with some non-optimized parameters, its AWR 
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Table 7: Comparison with support vector machine 







DDa-classifier 


SVM 






1-NN 


Mahalanobis 








Eucl. 


Mali. 


dist. 


depth 




Dataset 


Legend 


dist. 


Mom. 


MCD 


Mom. 


MCD 


Opt. (Avr.) 




Error 


28.26 


30.6 


34.51 


24.35 


31.77 


23 18 f44 991 




Timc:train 


16.63 


16.62 


16.59 


16.58 


17.39 


0.05 




Time:test 


0.033 


0.009 


0.0092 


0.0035 


0.0037 


0.0023 




Ker.Par/C 
% outsiders 


62.24 


62.24 


62.24 


62.24 


63.54 


0.056/1 


Ecoli 


Error 


10.29 


11.4 


12.13 


12.13 


16.18 


3.68 (47.43) 




Timertrain 


0.26 


0.26 


0.26 


0.26 


0.26 


0.0077 




Time:test 


0.014 


0.0026 


0.0032 


0.001 


0.00044 


0.0019 




Ker.Par/C 
% outsiders 


75 


75 


75 


75 


75 


5.62/1.78 


Glass 


Error 


18.49 


26.03 


31.51 


34.93 


34.93 


21.23 (51.96) 




Timc:train 


0.31 


0.32 


0.31 


0.32 


0.32 


0.0082 




Time:test 


0.0083 


0.0019 


0.0016 


0.00014 


0.00055 


0.0024 




Ker.Par/C 
% outsiders 


95.89 


95.89 


95.89 


95.89 


95.89 


0.56/1 


Iris 


Error 


37.33 


37.33 


37.33 


36 


46.67 


4.67 (66.67) 




Time:train 


0.07 


0.07 


0.07 


0.07 


0.07 


0.0051 




Time:test 


0.0046 


0.0018 


0.0013 


0.00033 


0.00047 


0.0017 




Ker.Par/C 
% outsiders 


50 


50 


50 


50 


50 


0.056/10 


Iris 


Error 


3.36 


3.36 


4.03 


2.68 


13.42 


2.68 (66.44) 


(Prepr.) 


Timc:train 


0.07 


0.07 


0.07 


0.07 


0.07 


0.0052 




Time:test 


0.0046 


0.0011 


0.0013 


0.0006 


0.00027 


0.0017 




Ker.Par/C 
% outsiders 


51.68 


51.68 


51.68 


51.68 


51.68 


0.1/3.16 
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can be considerably larger than that of the DDa-classifier. Average errors over 
the relevant intervals are given in parentheses. 

The times needed to classify a new object (also given in Table [7]) are quite 
comparable. But as the parameters of the SVM have to be adjusted first by 
running it many times for cross-validation, the computational burden of its 
training phase is much higher than that of the DDa-classifier, which has to be 
run only once. Recall that the latter is fully nonparametric. For example, in our 
implementation it took 875 seconds to determine approximate optimal values 
of SVM parameters for the Diabetes Data and similar times for the others. 

7 Discussion and conclusions 

A new classification procedure has been proposed that is completely nonpara- 
metric. The DDa-classifier transforms the d-variate data to a q-variate depth 
plot and performs linear classification in an extended depth space. The depth 
transformation is done by the zonoid depth, and the final classification by the 
a-procedure. The procedure has attractive properties: First, it proves to be 
very fast and efficient in the training as well as in the testing phase; in this 
it highly outperforms existing alternative nonparametric classifiers, and also - 
regarding the training phase - the support vector machine. Second, in many 
settings of elliptically distributed alternatives, its average misclassification rate 
is of similar size than that of the competing classifiers. Moreover, it is rather 
robust to outlier prone data. As a nonparametric approach, the new procedure 
shows a particularly good behavior under asymmetrically distributed alterna- 
tives and when the two classes originate from different families of distributions. 
Other than many competitors, it considers all classes in the multi-class classifica- 
tion problem even when performing binary classification Also several theoretical 
properties of the DDa-procedure have been derived: It operates in a rather sim- 
ple way if the data generating processes are elliptical, and a Bayes rule holds if 
q — 2 and the two classes are mirror symmetric. 

The zonoid depth has many theoretical and computational advantages: Most 
important here, it is efficiently computed also in higher dimensions. However, as 
it takes its maximum at the mean of the data, the zonoid depth lacks robustness. 
Nevertheless, the DDa-classifier shows a rather robust behavior. Its relative 
robustness can be explained as follows: The original data points are mapped 
to a compact set, the g-dimcnsional unit hypercube, and then classified by the 
a-procedure. The latter, by choosing a median angle in each step, is rather 
insensitive to outliers. 

Points that are not within the convex hull of at least one training set must 
be specially treated as their depth representation is zero. To classify those so 
called outsiders several approaches have been used and compared. Instead of 
assigning them randomly, which disadvantages the DDa-procedure like other 
procedures based on halfspace or simplicial depth, one should classify outsiders 
by 1-NN and some distance or by a properly chosen maximum depth rule. 

To contrast the DDa-procedure with an SVM approach, a novel way of com- 
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parison has been taken: An optimal performance of an SVM has been evaluated, 
that arises under an optimal choice of the parameters, as well as an average per- 
formance, where the parameters vary over specified conservative intervals. It 
came out that, even with an arbitrary handling of outsiders, the DDa-classificr 
mostly performs not much worse than an SVM whose parameters have been op- 
timally chosen. If the SVM is employed with some non-optimized parameters, 
the error rate can be considerably larger than that of the DDa-classifier. 

More investigations are needed on the behavior of the DDa-classifier on 
skewed or fat-tailed data, the - possibly adaptive - choice of outsider treatments, 
and the use of alternative notions of data depth. These are intended for future 
research. 

Acknowledgements: Thanks are to Raincr Dyckerhoff for his constructive 
remarks on the paper as well as to the other participants of the Witten Workshop 
on "Robust methods for dependent data" for discussions. 
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